Description

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective

Explore and visualize the dataset.

Build a classification model to predict if the customer is going to churn or not

Optimize the model using appropriate techniques

Generate a set of insights and recommendations that will help the bank

Data Dictionary:

CLIENTNUM: Client number. Unique identifier for the customer holding the account

Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"

Customer_Age: Age in Years

Gender: Gender of the account holder

Dependent_count: Number of dependents

Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.

Marital_Status: Marital Status of the account holder

Income_Category: Annual Income Category of the account holder

Card_Category: Type of Card

Months_on_book: Period of relationship with the bank

Total_Relationship_Count: Total no. of products held by the customer

Months_Inactive_12_mon: No. of months inactive in the last 12 months

Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months

Credit_Limit: Credit Limit on the Credit Card

Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance

Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)

Total_Trans_Amt: Total Transaction Amount (Last 12 months)

Total_Trans_Ct: Total Transaction Count (Last 12 months)

Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter

Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter

Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Importing Libraries

Data Overview

Looking at the first and last few rows of the dataset, we see the blue card occuring most in the Card Category column. We also see missing values that will need treating.

Looking at 10 random we see agin where the blue card is occuring the most.

• The average Customer Age is 45 with an average Dependent Count of 2.

• Customer Age ranges from 26 to 73 while the Dependent Count ranges from 0 to 5.

• The average Customer Relationship with the bank is 3 years; the max being about 5 years and the minimum being about 1 year.

• On average each customers holds about 4 products with the bank; the minimum being 1 and the max being 6.

From the information available we can assume more customers kept credit card accounts than closed them.

Looking at the information above there are some variables that need to be changed from "object" to "category". We also need to drop the "CLIENTNUM" column as it is redundant. We have missing values in "Education Level" and "Marital Status". We will drop Client Number and deal with the other specified issues later.

• The Sex of most of the customers in the dataset are Female.
• The Education Level that occurs the most is Graduate; the highest.
• Most customers are married; followed by Single, then Divorced.
• Most customers make less than $40k annually.
• The Blue Card is held the most by customers

• We see that Customer Age doesn't have many values. This may indicate many are the same and/or are close in range.

Univariate analysis

• Customer Age is normally distributed with very few outliers. The average age is about 46. We will drop these two rows as they are very far from the rest of the data.

• Dependent Count is normally distributed with no outliers. The average is around 2.

• The number of products held by each customer is slightly skewed to the left, with no outliers. The average is around 4.

• The balance carried from month to month is about 1125 on average. The lowest 0, and the highest about 2,500.

• Credit Limit is highly skewed to the right with many outliers on the higher end. The average Credit Limit is about 8,000 and ranges from 0 to about 35,000. We will not treat outliers here as they are values that can actually occur in this type of dataset.

• The average available credit is 7500, ranging from 0 to about 35,000. This variable is also highly skewed to the right. We will take a closer look at the outliers below to decide how to best deal with them.

• The average transaction amount is around 4375 ranging from 0 to about 17500. Outliers will not be treated here; could occur given the dataset.

• The average total customer transaction count over the last 12 months is about 65. This data is almost normally skewed with two outliers; will will drop these as they are low in number and far away from other values.

• On average customers utilize about 3 percent of their spending limit; ranging from 0 to 100 percent. We see this variable is heavily skewed to the right.

• 36 months is about the average time a customer has been a member of the bank. Ranging from 1 month to about 56 months. We see outliers on both ends of the distribution; could occur “real world”.

• Total Amount change is almost normally distributed with some outliers on the high and low ends. We will look further into these.

• Above we see all outlier customers for Total_Amt_Chng_Q4_Q1 are existing customers.

• Total Transaction Count Change Q4 Q1 is normally distributed with outliers on either ends of the spectrum. We will take a closer look at these.

• Above; most outlier customers for Total_Ct_Chng_Q4_Q1 are existing customers.

Most customers inactive with the bank went three months with no activity. With zero being the lowest and six the highest.

Observation of Attrition Flag

We see again in the above bar graph where more customers opted for Credit Card services than not in this dataset.

• The imbalanced target variable ‘Attrition Flag’ is noted. This will be adjusted in two models we build later.

Observation on Gender

• There are slightly more women than men in the dataset.

Observation on Dependent Count

• Most customers in the dataset have between two and three dependents.

Observation on Education Level

• As stated before most customers have an education level of Graduate, followed by High School then Undergraduate. Undergraduate and High School are on either ends of the education spectrum; this may contribute to the some of the skewed data we saw earlier.

• Most customers are either married or single; few are divorcees.

Observation on Income

Customers making less than 40k a year are significantly higher in number than the other income levels. Lower income customers holding cards more so than higher levels may indicate needing credit to cover expenses when cash is low. We also see a value,'abc', we will treat this as missing now and deal with it later.

Customers with the ‘abc’ mislabeling in the Income Category are Blue Card holders. It may be safe to say these customers make more than 40k a year, being that the Blue Card is held the most and most customers make less than 40k a year. We will let the imputer deal with this later to preserve the accuracy of the models built.

Observation on Card Category

• The Blue Card is overwhelmingly the card of choice, based on the data given, all others trial far behind. The other cards offered may need to have incentive awards and/or point to price programs but in place to generate more sales. We can assume these cards are held by customers with higher incomes, based on previous insights. Marketing needs to figure how to target and convert these customers

Observation on Total Relationship Count

• Most customers own between 3 and 6 products with the bank; 3 being the most frequent.

Observation on Months Inactive

• Two to three months are the two highest instances of inactivity of customers.

Observation on Contact Count

• Most customers are in contact with the bank 2 to 3 times a month.

Bivariate Analysis

• The pair plot shows that as ‘customer age’ increase so does ‘months on book’.

• Will continue to more Bivariate Analysis to gain more insight.

• Married customers spent more than both Single and Divorced customers. This may point to having more than one income in the household.

• This graph shows us that existing customers spent more on the credit cards than former card holders.

• Customers with an education level of High School and Graduate have the highest transaction amounts; surprising being that these two levels are furthest apart.

• There is some variation in Total Transaction Amount among the incomes; but nothing too significant; 60K-80K being the highest.

• Blue Card holders have the highest Total Transaction Amount, followed by Silver Card holders.

• Customers that own four products with the bank have the highest Total Transaction amount, customers that own three products being second.

• In regard to Customer Age, customers between the ages of 40 and about 55 have the highest transaction amounts.

• In regard to Dependent Count; customers with two or three dependents have the highest Transaction Amounts.

• Looking at Credit Limit and Customer Age, customers between about 40 and 50 have the highest.

• Looking at Dependent Count and Credit Limit, customers with two or three dependents have the highest.

• Customers with more Months on Book are more likely to use Credit Card services.

• Customers that own more products with the bank are more likely to utilize credit card services. Maybe the bank should try to advertise more products to customers that hold only one or two products with the bank.

• Customers that spend more per transaction are more likely to use Credit Card services.

• Customers that perform more transactions are more likely to use Credit Card services.

• We can see a negative relationship in Education Level and likeliness to remain a Credit Card holder. The higher the Education Level, the less likely.

• We also see a negative relationship in Income Category and likeliness to remain a Credit Card holder. The higher the Income, the less more likely; though variation isn't significant.Based on the last to Bar Charts, the bank needs to offer awards and other Credit Card incentives to gain high earning customers.

• Silver and Blue card holders utilize credit card services more so than Gold and Platinum customers.

• Single customers are less likely to be a card holder over Divorced and Married. Variation is very insignificant here.

• This graph shows that customers active the most with the bank are more likely to leave Credit Card Services. One logical explanation for this could be poor customer service. Another could be not being satisfied with some aspect of the Credit Card terms and condition. Management needs to drill down into this and find a root cause.

• Surprisingly this graph shows that increased customer contact with the bank leads to customers leaving Credit Card Services. The Customer Service Department needs to be retrained on how to interact with customers.

• Customers that own more products with the bank are more likely to be card holders.

• Dependent Count doesn’t show any significant variation in determining a potential card holder. All values are very close here.

• Men are slightly more likely to opt for Credit Card services.

• Customer Age and Months on Book have a positive correlation. This is expected; the older people get the less likely they are to go through the hassle of changing banks.

• Credit limit and Average open To Buy have a positive correlation; not surprising, higher credit limits doesn’t necessarily mean more spending. In most cases higher credit limits indicate smart spending habits.

• Total Transaction Amount and Total Transaction Count have a positive correlation. It’s expected that customers who make more transactions are likely to spend more per transaction, because they likely use their card for small and major purchases alike. High frequency use of a credit card will result in transaction limits on both ends of the cost spectrum.

• Average Utilization Ratio and Credit Limit are negatively correlated. Customers that keep a revolving balance of more than 30 percent on their cards are less likely to receive credit line increases.

• Average Utilization Ration and Average Open to Buy are negatively correlated.

• Total Revolving Balance and Average Utilization Ration are positively correlated.

Data Preparation for Modeling

Missing-Value Treatment

• We will impute missing values using the most frequent values.

Model Evaluation Criterion

Potential Wrong Predictions

  1. Predicting a customer won’t utilize Credit Card services and they do.
  2. Predicting a customer will utilize Credit Card services and they don’t.

Most Important Case

• Predicting customer will not utilize Credit Card services and they do; losing on a potential source of income for the company because that customer will not be targeted by the marketing team when he/she should be targeted.

To reduce this loss we will have to reduce False Negatives in our models.

Company wants recall to be maximized; we need to reduce the number of false negatives to achieve this.

Model Building

• Xgboost has the highest Cross-Validation recall score followed by Gradient Boost, then Random Forest. The performances of these generalized will on the Validation set as well.

Model building - Up Sample

Oversampling train data using SMOTE

• Like the first model we built Xgboost has the highest Recall Score when using up-sampling, followed by Random Forest and Gradient Boost.

• This model also generalized will on the Validation set.

Model building - Down Sample

• Xgboost again is showing the highest Recall Score for down-sampling, followed closely by Gradient Boost and Random Forest.

• This model also generalized well on the Validation set.

Model Tunning

• We will tune our Random Forest, Gradient Boost, and Xgboost as they have performed the highest and generalized well on Training and Validation sets.

Random Forest

GridSearchCV

• The Random Forest generalized will on Training and Validation sets wit Grid Search

RandomizedSearchCV

• The Random Forest also generalized will with Random Search tuning

• The scores are very close to that of the Grid Search

Gradient Boost

GridSearchCV

• Gradient Boost generalized well on Training and Validation sets; does a good job with predictions

• The scores aren’t too far off from those of the Random Forest.

RandomizedSearchCV

• Gradient Boost does better predicting with Random Search

• Generalized well on both sets. This model is showing the most promise so far.

Xgboost

GridSearchCV

• Xgboost with Grid Search generalized well.

• Doesn’t perform as well as Gradient Boosting.

RandomizedSearchCV

• Xgboost with Random Search is virtually the same as with Grid search

• We will not be using this model, as Gradient Boost proves better as of now.

Comparing all models

Gradient Boosting with Random Search has proved to be the best model. It generalizes well on the Training and Validation data. This will be out model of choice.

Performance on the test set

Pipelines for productionizing the model

We will create 2 different pipelines, one for numerical columns and one for categorical columns

For numerical columns, we will do missing value imputation as pre-processing

For categorical columns, we will do one hot encoding and missing value imputation as pre-processing

We are doing missing value imputation for the whole data, so that if there is any missing value in the data in future that can be taken care of.

Data is ready; splitting into two parts.

Business Recommendations

• The bank needs to target customers with high Transaction Counts, Transaction amounts, and Revolving balances. These customers spend more money and make more purchases. Incentives for card usage would be a good way to capture new customers and retain current ones.

• We also observed earlier than we are missing opportunities with our high earning bank members. These customers need targeted advertising offering them exclusive deals and incentives to use Credit Card services.

• We need to find a way to increase the sale of our other products. We saw in the analysis where customer who owned more products were more likely to use Credit Card services. Generating sales in this way would boost profits.

• A major issue we need to fix is the negative relationship between customer contact with the bank and attrition. The data shows that customers that converse with the bank more often leave credit card services. Training needs to be done with customer service personnel and the Credit Cards need to be looked at to see what fees and terms are leading to customer attrition.

• Customers most likely to leave Credit Card services: those in contact with the bank the more often, those that own fewer products with the bank besides credit cards, those with higher income, and those with higher education.